Introduction

Data set on various variables that could possibly impact on students’ academic performance has been chose for this report.

Here are 3 questions to be answered in this report using the data set chosen.

  1. Which variable corresponds with the least and the greatest impact on a students’ academic performance?
  2. Does the relationship status of a students’ family have an impact on their grades? If so, how do they relate?
  3. What is the correlation between alcohol consumption and variables that impact academic performance such as hours studying, number of absences, number of class failures?

Data

The data set chosen comes from Aman Chauhan on: https://www.kaggle.com/datasets/whenamancodes/alcohol-effects-on-study?datasetId=2479552 (keggle).

It is called “Alcohol Effects On Study”, but it also has many other variables describing students.

The data set describes performance of students in Mathematics class and Portuguese class with a variety of attributes of students from two Portuguese schools. These attributes include their gender, status of parents, alcohol consumption, their activities etc. We have chose to use maths data set only as mathematics is studied globally where Portuguese is not.

Table 1. First 6 Rows of Data Set
school sex age address famsize Pstatus Medu Fedu
GP F 18 U GT3 A 4 4
GP F 17 U GT3 T 1 1
GP F 15 U LE3 T 1 1
GP F 15 U GT3 T 4 2
GP F 16 U GT3 T 3 3
GP M 16 U LE3 T 4 3
Mjob Fjob reason guardian traveltime studytime failures schoolsup
at_home teacher course mother 2 2 0 yes
at_home other course father 1 2 0 no
at_home other other mother 1 2 3 yes
health services home mother 1 3 0 no
other other home father 1 2 0 no
services other reputation mother 1 2 0 no
famsup paid activities nursery higher internet romantic famrel
no no no yes yes no no 4
yes no no no yes yes no 5
no yes no yes yes yes no 4
yes yes yes yes yes yes yes 3
yes yes no yes yes no no 4
yes yes yes yes yes yes no 5
freetime goout Dalc Walc health absences G1 G2 G3
3 4 1 1 3 6 5 6 6
3 3 1 1 3 4 5 5 6
3 2 2 3 3 10 7 8 10
2 2 1 1 5 2 15 14 15
3 2 1 2 5 4 6 10 10
4 2 1 2 5 10 15 15 15
Table 2. Attribites Types and Descriptions
Variable Name Type Description & Sample Space
school categorical Students’ school: [GP, MS]
sex categorical Students’ sex: [F, M]
age quantitative Students’ age: [15-22]
address categorical Students’ home address type: [U, R]
famsize categorical Students’ family size: [LE3, GT3]
Pstatus categorical Parents’ cohabitation status: [T, A]
Medu categorical Mother’s education: [0:4]
Fedu categorical Father’s education: [0:4]
Mjob categorical Mother’s job
Fjob categorical Father’s job
reason categorical Why students chose the school
guardian categorical Students’ guardian
traveltime categorical Travel time from home to school: [1:4]
studytime categorical Weekly study time: [1:4]
failures categorical Number of past class failures: [0:4]
schoolsup categorical Extra educational support: [yes, no]
famsup categorical Family educational support: [yes, no]
paid categorical Extra paid classes: [yes, no]
activities categorical Extracurricular activities: [yes, no]
nursery categorical Attended nursery school: [yes, no]
higher categorical Interested in higher education: [yes, no]
internet categorical Availability of internet at home: [yes, no]
romantic categorical In a romantic relationship: [yes, no]
famrel categorical Family relationship quality: [1:5]
freetime categorical Free time after school: [1:5]
goout categorical Goes out with friends: [1:5]
Dalc categorical Alcohol consumption on workdays: [1:5]
Walc categorical Alcohol consumption on weekends: [1:5]
health categorical Student’s health: [1:5]
absences quantitative Number of school absences: 0~93
G1 quantitative First period grade: 0~20
G2 quantitative Second period grade: 0~20
G3 quantitative Final grade: 0~20

There are 30 variables and 3 targets (G1, G2, G3). According to the author of the data:

From the histogram, other than the outliers where the grades are 0s, the grades are normally distributed.

Here are some other statistics on G3.

Table 3. Summary of G3
Values
Variable Name G3
Mean 10.4151898734177
Minimum Value 0
Q1 8
Median Value 11
Q3 14
Maximum Value 20

“Table 4. Summary of Data Set”

Data summary
Name maths_study
Number of rows 395
Number of columns 33
_______________________
Column type frequency:
character 17
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
school 0 1 2 2 0 2 0
sex 0 1 1 1 0 2 0
address 0 1 1 1 0 2 0
famsize 0 1 3 3 0 2 0
Pstatus 0 1 1 1 0 2 0
Mjob 0 1 5 8 0 5 0
Fjob 0 1 5 8 0 5 0
reason 0 1 4 10 0 4 0
guardian 0 1 5 6 0 3 0
schoolsup 0 1 2 3 0 2 0
famsup 0 1 2 3 0 2 0
paid 0 1 2 3 0 2 0
activities 0 1 2 3 0 2 0
nursery 0 1 2 3 0 2 0
higher 0 1 2 3 0 2 0
internet 0 1 2 3 0 2 0
romantic 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 16.70 1.28 15 16 17 18 22 ▇▅▅▁▁
Medu 0 1 2.75 1.09 0 2 3 4 4 ▁▃▆▆▇
Fedu 0 1 2.52 1.09 0 2 2 3 4 ▁▆▇▇▇
traveltime 0 1 1.45 0.70 1 1 1 2 4 ▇▃▁▁▁
studytime 0 1 2.04 0.84 1 1 2 2 4 ▅▇▁▂▁
failures 0 1 0.33 0.74 0 0 0 0 3 ▇▁▁▁▁
famrel 0 1 3.94 0.90 1 4 4 5 5 ▁▁▃▇▅
freetime 0 1 3.24 1.00 1 3 3 4 5 ▁▃▇▆▂
goout 0 1 3.11 1.11 1 2 3 4 5 ▂▆▇▅▃
Dalc 0 1 1.48 0.89 1 1 1 2 5 ▇▂▁▁▁
Walc 0 1 2.29 1.29 1 1 2 3 5 ▇▅▅▃▂
health 0 1 3.55 1.39 1 3 4 5 5 ▂▂▅▃▇
absences 0 1 5.71 8.00 0 0 4 8 75 ▇▁▁▁▁
G1 0 1 10.91 3.32 3 8 11 13 19 ▂▇▇▆▂
G2 0 1 10.71 3.76 0 9 11 13 19 ▁▂▇▆▂
G3 0 1 10.42 4.58 0 8 11 14 20 ▂▃▇▅▁

The data set has no missing data. No outlier exists since all data is categorical. Carnalities seem even, no irregular carnality noticed.

Therefore, analysis was made with the data with no pre process of data.

Exploratory analysis

Question 1

Question we are trying to answer: which variable corresponds with the least and the greatest impact on a students’ academic performance?

Correlations

In this graph, correlations between grades are high as the author of the data set stated. Therefore, we will not be using all three target variables (G1, G2, and G3), we will be only using G3 which is the overall grade for analytic purposes.

In order to look at all variables, attributes are separated into 4 domains:

  • Family related: famsize, Pstatus, Medu, Fedu, Mjob, Fjob, guardian, famrel, and famsup
  • Entertainment related: Dalc, Walc, freetime, romantic, goout, and traveltime
  • Academic related: studytime, activities, schoolsup, absences, failures, nursery, and higher
  • Other attributes related to students: sex, age, address, paid, internet, and health

This shows correlations of G3 with family related variables.

This show correlations of G3 and entertainment related variables.

This show correlations of G3 and Academic related variables such as higher/failures.

This shows correlations of G3 and other attributes students such as age.

Table 5. First 6 Rows of Data with Converted to Numeric Values
school sex age address famsize Pstatus Medu Fedu
1 1 18 2 1 1 4 4
1 1 17 2 1 2 1 1
1 1 15 2 2 2 1 1
1 1 15 2 1 2 4 2
1 1 16 2 1 2 3 3
1 2 16 2 2 2 4 3
Mjob Fjob reason guardian traveltime studytime failures schoolsup
1 5 1 2 2 2 0 2
1 3 1 1 1 2 0 1
1 3 3 2 1 2 3 2
2 4 2 2 1 3 0 1
3 3 2 1 1 2 0 1
4 3 4 2 1 2 0 1
famsup paid activities nursery higher internet romantic famrel
1 1 1 2 2 1 1 4
2 1 1 1 2 2 1 5
1 2 1 2 2 2 1 4
2 2 2 2 2 2 2 3
2 2 1 2 2 1 1 4
2 2 2 2 2 2 1 5
freetime goout Dalc Walc health absences G1 G2 G3
3 4 1 1 3 6 5 6 6
3 3 1 1 3 4 5 5 6
3 2 2 3 3 10 7 8 10
2 2 1 1 5 2 15 14 15
3 2 1 2 5 4 6 10 10
4 2 1 2 5 10 15 15 15

This shows correlations of G3 with family related variables such as mother/father education etc. G3 has the highest correlation with the level of mothers’ education

This show correlations of G3 and entertainment related variables such as goout/Travel time etc. G3 has the highest correlation with the amount of time going out.

This show correlations of G3 and Academic related variables such as higher/failures. G3 has the highest correlation with the failures variable.

This shows correlations of G3 and other attributes students such as age. Here G3 has the highest correlation with a student’s age variable.

Dimentionality Reduction

The data set contains lots of variables; hence, we have analyzed them with PCA.

Table 6. PCA Resulting PCs (First 5)
PC1 PC2 PC3 PC4 PC5
school 0.0034393 0.0035950 -0.0146910 0.0533958 -0.0050823
sex 0.0042130 -0.0088727 -0.1283385 -0.0076321 -0.0239502
age -0.0289796 0.0314500 -0.0756241 0.3983161 -0.0921604
address 0.0015917 -0.0094165 0.0026950 -0.0474724 0.0193215
famsize -0.0019793 -0.0073368 -0.0286624 0.0230301 0.0129049
Pstatus 0.0051238 0.0026267 0.0008888 0.0174341 -0.0066342
Medu -0.0133557 -0.0561666 -0.1191241 -0.5107038 0.0193417
Fedu -0.0029806 -0.0460025 -0.1174808 -0.4423881 -0.0448584
Mjob -0.0076591 -0.0284290 -0.1975051 -0.4825205 -0.0734298
Fjob -0.0008364 -0.0126785 -0.0868130 -0.1217384 0.0135957
reason -0.0176413 -0.0329332 0.1507109 -0.0928095 0.2591892
guardian -0.0111539 0.0073716 0.0151965 0.0519465 -0.0199573
traveltime 0.0008689 0.0204531 -0.0344538 0.0996479 -0.0221689
studytime 0.0069854 -0.0273182 0.1639636 -0.0271974 -0.0092362
failures -0.0067216 0.0584286 -0.0538223 0.1201256 -0.0032652
schoolsup -0.0010061 0.0109059 0.0260322 -0.0335882 0.0355734
famsup -0.0015418 0.0062674 0.0193152 -0.0823382 0.0104148
paid -0.0004285 -0.0089173 -0.0070881 -0.0455181 0.0635420
activities 0.0009319 -0.0062248 -0.0054827 -0.0509431 -0.0060290
nursery -0.0008824 -0.0066744 0.0123192 -0.0509694 -0.0014797
higher 0.0016581 -0.0086730 0.0089801 -0.0291958 0.0003309
internet -0.0046603 -0.0088330 -0.0166627 -0.0530546 0.0287127
romantic -0.0091878 0.0073550 0.0098075 0.0151318 -0.0379199
famrel 0.0050636 0.0005706 0.0003452 -0.0122973 -0.1384288
freetime 0.0071714 0.0024534 -0.2527927 0.0136479 -0.0566476
goout -0.0070293 0.0394399 -0.3925895 0.0615628 0.1144059
Dalc -0.0130810 0.0159337 -0.3753694 0.0817900 0.1035887
Walc -0.0230123 0.0309674 -0.6358956 0.1913719 0.2163594
health 0.0050971 0.0290182 -0.2784197 -0.0698200 -0.6409292
absences -0.9981818 -0.0321442 0.0214435 -0.0021858 -0.0133280
G1 0.0204953 -0.6473123 0.0128437 0.1528869 -0.4902631
G2 0.0238063 -0.7500806 -0.0716886 0.0005639 0.4026845

And one interesting information we found was the number of absences had a significant negative impact on G3 in PC1.

From the analysis, because attributes absences had small correlation in ggpair, but big impact in PCA, it can be stated that absences alone does not tell us a lot about the student and their grades; however, when there are more and more other attributes available, absences variable can tell us a lot more about the student and their grades.

However, note that when the data set was being converted into numeric values, some variables do not have any relation to the numbers they are assigned to. For instance, attribute Mjob has value “at_home” and it is assigned to value 1 and other values to other numbers. There is no specific order to these; hence, tools like PCA may work poorly or possibly inaccurate at all.

Question 2

Second question we are trying to answer:

  • Does the relationship status of a students’ family have an impact on their grades? If so, how do they relate?

First, variables impacting on the support from family availability is looked at.

Data Filtering

Only the variables related to family attributes are selected.

First 6 rows of the selected table:

Table 7. First 6 Rows of Selected Variables
famsize Pstatus Medu Fedu Mjob Fjob famsup famrel guardian age G3
GT3 A 4 4 at_home teacher no 4 mother 18 6
GT3 T 1 1 at_home other yes 5 father 17 6
LE3 T 1 1 at_home other no 4 mother 15 10
GT3 T 4 2 health services yes 3 mother 15 15
GT3 T 3 3 other other yes 4 father 16 10
LE3 T 4 3 services other yes 5 mother 16 15

Plots

Scatter plots of some of variables are generated; however, all variables but age and G3 are categorical, scatter plots are meaningless.

So, as you can study from the data, almost all variables in the data set we have is categorical with 5 or less different values; hence, it is almost meaningless to make scatter plots.

Now, because of the reason stated above, other methods will be required.

First approach was to just observe the variables and see how they relate to each other.

Table 8. First 6 Rows of Pivoted Table for Occupation
famsup Pstatus Mother/Father Occupation
no A Mjob at_home
no A Fjob teacher
yes T Mjob at_home
yes T Fjob other
no T Mjob at_home
no T Fjob other

From the visualization, occupation other dominates. Generally, no matter the occupation of the parents, there are more supportive parents than those who are not supportive.

Hence, now PStatus variable is taken into account which tells us whether the parents live with the student together or apart.

But because there are a lot more supportive parents, it is hard to observe the ratio of not supportive parents, so they are now separated.

Second approach was PCA as there are so many variables.

Because PCA can take numeric values only, all categorized or character values must be switched to numeric values.

Table 9. PCA Resulting PCs for Family Related Attributes
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
famsize 0.0081065 -0.0198859 -0.0548274 -0.0296140 -0.0576411 -0.0765147 -0.2442095 -0.4945040 0.8071575 0.1755080
Pstatus 0.0200118 0.0038622 -0.0109463 0.0120066 0.0340096 -0.0223011 -0.0362613 0.1717773 -0.1180834 0.9761663
Medu -0.5706330 -0.1411673 0.2665410 -0.0283997 -0.2297156 0.7024299 -0.1878314 0.0009441 0.0060222 0.0332307
Fedu -0.5009064 -0.0446285 0.5967171 0.0588191 0.0282867 -0.5989000 0.1205548 -0.1069176 -0.0415116 0.0200166
Mjob -0.5298475 -0.3610184 -0.7126036 -0.0881186 -0.0741394 -0.2501915 0.0352549 0.0444049 -0.0442618 -0.0096079
Fjob -0.1632671 -0.0789846 -0.0153457 0.0601673 0.9581279 0.1855031 0.0474516 -0.0232121 0.0891257 -0.0097672
famsup -0.0626616 0.0390097 0.0558964 -0.0038466 -0.0682508 0.0237808 0.4290888 0.6941433 0.5649230 -0.0331478
famrel 0.0003437 -0.0959274 -0.0749396 0.9889580 -0.0759743 0.0192720 -0.0205265 0.0028039 0.0224926 -0.0080796
guardian 0.0489614 -0.1369487 -0.0161683 -0.0047194 -0.0895281 0.2070491 0.8319369 -0.4687818 -0.0522571 0.1143387
age 0.3305847 -0.9008985 0.2296031 -0.0724747 0.0004642 -0.0273072 -0.0952924 0.1020605 0.0227234 -0.0191374

From the table, PC1 has huge impacts by Medu, Fedu, and Mjob.

This is interesting because the the plots are telling us that the occupation of mothers matter a lot more than the occupation of fathers. If we refer back the figure x, there are many mothers staying home support children fully. We expect this is one of the reasons.

Linear Model

Now, linear models will be used to see the trends of G3 over age.

From this visualization, notice how the mean grade of students is extremely high. This is because there are only 3 students those are 20 years old. For ages 21 and 22, there are only one student each. Hence, we will ignore the 5 instances by filter only the instances with age less than 20.

Both mean and median grades of students show some decrease over age.

Therefore, a general simple linear fit also shows a decrease in trend.

25%, 50%, and 75% percent quantiles all show decrease in trend.

Figures x and y show that grades of students with no family support available more steep decrease in grades over grade. Also figure y tells us that students with family support around the lower 25% tend to do better over age; however, those without the support keep decreasing in grades.

Earlier, we noticed that the occupation of mothers matter significantly more than the occupation of fathers. Hence, color is set to mothers’ occupation to see why that is the case.

From both figures x and y, we can observe that students with mothers staying home are getting better grades over time. This is expected as mothers staying home will be able to spend more time on their children and care more.

Question 3

Last question we are trying to answer: What is the correlation between alcohol consumption and variables that impact academic performance such as hours studying, number of absences, number of class failures?

Final visualizations

Here are the significant data visualizations generated in order to answer question 2.

Here are the significant data visualizations generated in order to answer question 3.

Interpretation and conclusion

Student absences have the greatest negative impact on their grades and there is no single variable relating to overtly positive effects on grades. Note that absences tell us more about the students as more attributes are available.

The family life of a student does have an impact on their grades. For instance, support availability from parents had impact on the grades. Hence, we can conclude that family related attributes effect students’ academic performances.

Based on the visualizations, alcohol consumption does have an effect on study time but it does not have an effect on absences, and study time has impact on academic performances: increased alcohol consumption is associated with studying less. Therefore, we can conclude that alcohol consumption has impact on the grades.

In conclusion, there are many factors when it comes to students’ academic performances. They can very based on many different attributes!